Exploring population genetics of EBV genomes in BL endemic regions

Here, I would like to share some of my results exploring the single nucleotide variation data of EBV genomes we sequenced. All the code and data have been deposited to the repo https://github.com/yasinkaymaz/ViralGenomeAssembly Note: Directories are relative to repo master.

We are loading SNV data of EBV genomes from 58 eBL and 40 Healthy controls, excluding the plasma genomes of paired samples and replicated genomes.

#Load the data
pops <- read.delim(here::here("workspace/data/pop_info.txt"),row.names = 1, header = T)
summary(pops)
##                Method          EBV_Type           Condition 
##  Direct Sequencing:33   Inter-typic: 3   eBL           :58  
##  GenomiPhi-WGA    :13   Type1      :60   HealthyControl:40  
##  Preamp-sWGA      :52   Type2      :35

I generated a vcf file out of De Novo assembled genomes following the steps:

  1. multiple sequence alignment (scripts/msa_mafft.sh)

  2. Fixed the insertions/deletions (bin/MSA_InsertCleaner.py & bin/MSA_Gap-Singleton_Editor.py)

  3. Determined single nucleotide variant locations (bin/MSA_SNP_finder.py)

  4. Removed repetitive regions (bin/MSA_cleaner.py)

  5. Finally, generated a vcf file from the output using ‘snp-sites’ tool.

Resulting vcf file stores 9488 variant locations from 98 genomes.

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 3
##   header_line: 4
##   variant count: 7090
##   column count: 107
## 
Meta line 3 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 7090
##   Character matrix gt cols: 107
##   skip: 0
##   nrows: 7090
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant: 7090
## All variants processed

Need to convert vcf file into genlight (gl) format. This conversion drops 200 loci with more than two alleles, which are not currently supported. What is left is 8249 loci from 98 genomes.

I then want to remove genomes with too many missing data, in other words, short assemblies due to lack of coverage. Remove individual genomes if only less than 25% of all variants locations are covered. I also only keep loci at which at least 25% of genomes have been covered.

## Starting gl.filter.callrate: Filtering on Call Rate
##   Removing loci based on Call Rate, threshold = 0.25 
## gl.filter.callrate completed
## Starting gl.filter.callrate: Filtering on Call Rate
##   Removing individuals based on Call Rate, threshold t = 0.25 
##   List of individuals deleted because of low call rate: HC-0031 HC-0033 HC-0038 from populations:  
## Starting gl.filter.monomorphs: Deleting monomorphic loci
##   Deleting monomorphic loci and loci with all NA scores
## Completed gl.filter.monomorphs
## 
## Starting gl.recalc.metrics: Recalculating locus metrics
## Starting utils.recalc.avgpic: Recalculating OneRatioRef, OneRatioSnp, PICRef, PICSnp and AvgPIC
## Completed utils.recalc.avgpic
## 
## Starting utils.recalc.callrate: Recalculating CallRate
## Completed utils.recalc.callrate
## 
## Starting utils.recalc.freqhets: Recalculating frequency of heterozygotes
## Completed utils.recalc.freqhets
## 
## Starting utils.recalc.freqhomref: Recalculating frequency of homozygotes, reference allele
## Completed utils.recalc.freqhomref
## 
## Starting utils.recalc.freqhomref: Recalculating frequency of homozygotes, alternate allele
## Completed utils.recalc.freqhomsnp
## 
## Note: Locus metrics recalculated
## Completed gl.recalc.metrics
## 
## gl.filter.callrate completed

This results to drop 3 healthy control genomes with low coverage; “HC-0031”, “HC-0033”, “HC-0038”.

We are left with 7458 variants from 95 genomes in total now.

Principle Coordinates Analysis (PCoA)

Now, I run PCOA analysis on the data generating first 20 PCs:

## Performing a PCoA, individuals as entities, SNP loci as attributes
## Ordination yielded 13 informative dimensions from 94 original dimensions
##   PCoA Axis 1 explains 43.9 % of the total variance
##   PCoA Axis 1 and 2 combined explain 52 % of the total variance
##   PCoA Axis 1-3 combined explain 57.2 % of the total variance

Here is the PCOA of 1st and 2nd axis with all variants.

PCoAxis 2nd and 3rd:

PCoAxis 4nd and 5rd:

Then, I plot the PCA loading values for each genomic location

df <- as.data.frame(pc$loadings)
ggplot(df, aes(x=as.numeric(gl@position),y=Axis1, group = 1))+
  geom_point(size=.1)+
  labs(x="genomic positions")+
  ggtitle("PCA Loadings from Axis 1")

Calculating Fst for variant loci

Here, I wanted to calculate locus-by-locus Fst scores to determine accumulations of genomics regions that reflect differentiation between the two comparison groups. Fixation index ranges between 0 and 1, where 0 means complete exchange of genetic material while 1 means no sharing and complete differentiation between the groups.

First, I am plotting the locus-by-locus Fst for Type 1 and Type 2 genomes regardless of disease status:

This, second one is for eBL associated EBV genomes and Healthy control genomes:

Genetic distance

As we discussed, I am now calculating the pairwise genetic distance of genomes for each gene and then take the average. Mean pairwise distances calculated using Kimura 2 parameter method. Error bars represent standard error of mean.

    Kimura 2-Parameter distance = -0.5 log( (1 - 2p -q) * sqrt( 1 - 2q ) )
    where:
    p = transition frequency
    q = transversion frequency

Type 1 and Type 2 genomes separately:

Delta d:

dN/dS over protein coding genes

I took the main python functions from https://github.com/a1ultima/hpcleap_dnds/

            dS  = -(3./4.)*math.log(1.-((4./3.)*pS))
            dN  = -(3./4.)*math.log(1.-((4./3.)*pN))
            
            where; 
                  pN: The count of the number of nonsynonymous differences (Nd) normalized by the number of all possible nonsynonymous sites (N)
                  pS: The count of the number of synonymous differences (Sd) normalized by the number of all possible synonymous sites (S)
                  
            dN_dS = dN/dS

Here is the source for equations from Mega software

Type 1 and Type 2 genomes separately: